arXiv:1502.07028vl [cond-mat.str-el] 25 Feb 2015 


Parallelized Traveling Cluster Approximation to Study Numerically 
Spin-Fermion Models on Large Lattices 

Anamitra Mukherjee^, Niravkumar D. Patel^, Chris Bishop^, and Elbio Dagotto^’^ 
^Department of Physies and Astronomy, The University of Tennessee, Knoxville, Tennessee 37996, USA and 
‘^Materials Seienee and Teehnology Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA 

(Dated: February 26, 2015) 

Lattice spin-fermion models are important to study correlated systems where quantum dynamics 
allows for a separation between slow and fast degrees of freedom. The fast degrees of freedom are 
treated quantum mechanically while the slow variables, generically refereed to as the “spins”, are 
treated classically. At present, exact diagonalization coupled with classical Monte Carlo (ED+MC) 
is extensively used to solve numerically a general class of lattice spin-fermion problems. In this 
common setup, the classical variables (spins) are treated via the standard MC method while the 
fermion problem is solved by exact diagonalization. The “Traveling Cluster Approximation” (TCA) 
is a real space variant of the ED+MC method that allows to solve spin-fermion problems on lattice 
sizes with up to 10^ sites. In this publication, we present a novel reorganization of the TCA algorithm 
in a manner that can be efficiently parallelized. This allows us to solve generic spin-fermion models 
easily on 10^ lattice sites and with some effort on 10^ lattice sites, representing the record lattice 
sizes studied for this family of models. 


I. INTRODUCTION 

The rich physical properties displayed by many mate¬ 
rials arise from strong correlations among multiple de¬ 
grees of freedom mm- Studying theoretically these ma¬ 
terials has been a long standing challenge for materials 
theory since treating those coupled multiple degrees of 
freedom (DOE), such as the spin, charge, orbital, and 
lattice, on equal footing in a model Hamiltonian calcu¬ 
lation is extremely difficult. As a consequence, accurate 
approximations that render such complex problems more 
tractable have always been of considerable interest. Dy¬ 
namical Mean Eield Theory [3], Determinant Quantum 
Monte Carlo HE], and the Density Matrix Renormal¬ 
ization Group [7] are some of those approximations that 
have led to important insights into the physics of corre¬ 
lated materials. Another useful approximation is to ex¬ 
ploit the relative slow dynamics of some degrees of free¬ 
dom as compared to others. As discussed below, this 
approach allows for the modeling of some complex ma¬ 
terials with relative ease and on reasonably larger lattice 
sizes. 

In materials such as the manganites mm, double per- 
ovskites nni, rare earth nickelates [TTJ [12], and others, 
the slow and fast separation is a good approximation. 
Eor example, in the manganites, the electrons in the eg 
orbitals have faster dynamics as compared to the dynam¬ 
ics of the localized t 2 g electrons and also compared to the 
Jahn-Teller and breathing mode phonons [ills]. This 
allows for a separation between “fast” and “slow” DOE. 
The quantum+classical approach treats the slow vari¬ 
ables in the strict adiabatic limit, z.e., classically. Gener¬ 
ically the slow variables that are considered classically 
are called “spins” and for this reason the models are com¬ 
monly referred to as “spin-fermion” models. 

The main advantage of this approximation is that the 
original fully interacting quantum many body problem 
can be mapped into a problem of noninteracting fermions 


coupled with, in general, spatially fluctuating classical 
fields. In the past, such classical+quantum approaches 
have been extensively used. Some well known methods in 
this context include the study of electron-phonon systems 
[3 [ID, the Born-Oppenheimer approximation [16] ? 
the Gar-Parrinello method m- Spin-fermion models for 
the manganites dSHHHH], double perovskites [25l [26] , 
nickelates {27] [28] , copper based high temperature super¬ 
conductors [291133] . BGS superconductors [34H37] . and 
the recently discovered iron superconductors jSSHlS] have 
all exploited the slow and fast variables to considerable 
success. 

Solving such spin-fermion models entails the search 
for the optimal configurations of the classical DOE that 
minimize the free energy. To achieve this goal, first the 
fermionic probiem is diagonaiized for a fixed configura¬ 
tion of the ciassicai DOE and the energy is computed. 
The ciassicai variabies are then updated and the energy 
is recaicuiated in the updated background. The updates 
are accepted or rejected via the Metropoiis aigorithm. 
Einally the procedure is repeated untii thermai equiiib- 
rium is reached and observabies can be measured with 
reasonabie accuracy. 

This Exact Diagonaiization + Monte Garlo (ED+MG) 
approach is free from the “sign probiems” suffered by 
Quantum Monte Garlo methods and it can also include 
the study of long range spatial correlations unlike simple 
DMET approaches. Over the years the ED+MG method 
has enjoyed considerable success in understanding corre¬ 
lated materials phenomena where the separation of slow 
and fast DOE is possible m- However, even after the 
considerable numerical simplification due to the quan- 
tum+classicai treatment, the ED for the fermion probiem 
stiii has to be carried out at every update of the ciassicai 
fieids resuiting in thousands of diagonaiizations at every 
temperature where the caiculation is performed. Eurther- 
more, the simuiated anneaiing from high to iow temper¬ 
atures, which is often required to avoid being trapped in 
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metastable states, requires sequential temperature steps. 
All these steps amount to a prohibitively large number of 
diagonalizations to be performed in a standard ED+MC 
calculation. This typically limits the accessible lattice or 
system sizes that can be solved using ED+MC to ^ 10^. 

The ability to solve such spin-fermion problems on 
larger systems is needed to address issues such as large 
length scale phase separation tendencies, to achieve accu¬ 
rate estimations of thermodynamic order and transport 
properties with small size effects, and to be able to per¬ 
form reliable finite-size scaling analysis. Moreover, stud¬ 
ies of the iron based superconductors have pointed out 
the need to study spin-fermion and Hubbard-like models 
incorporating multiple orbitals [441448] . This task is chal¬ 
lenging even on small system sizes due to large Hilbert 
spaces. In these regards the simple “Traveling cluster 
approximation” (TCA) [49] is an important step forward 
as it allows access to system of ^ 10^ sites. This ap¬ 
proximation is discussed below. In this publication, we 
present an alternate way to organize the TCA algorithm 
that allows for massive parallelization of the method. As 
a result, the calculation of spin-fermion models can now 
be performed on system sizes up to ^ 10^ sites. Addi¬ 
tionally, as discussed later, in the present generalization 
very large traveling clusters can be used for the TCA cal¬ 
culation. The small size of the traveling clusters has, till 
now, remained a limitation of the TCA approach. 

Below we describe the parallelization scheme and 
benchmarks that we developed. As discussed in the text, 
techniques of the nature developed here will be instru¬ 
mental in addressing problems in multiband Hubbard 
models as well. 

The paper is organized as follows. In section H, we ex¬ 
plain the TCA technique and compare it with ED+MC. 
In section HI, we discuss our approach for parallelizing 
the TCA algorithm. In section IV, we present bench¬ 
marking results and in section V we provide some phys¬ 
ically relevant results for the one orbital Hubbard model 
both in two and three dimensions and compare them with 
existing literature. In sections VI and VH, we discuss 
some pertinent numerical issues and in section VHI, we 
present the conclusions of the manuscript. 


II. TRAVELING CLUSTER APPROXIMATION 

Let us begin by briefly discussing the basics of the 
ED+MC and TCA approaches. 

a. ED+MC : As mentioned before, the spin-fermion 
model consists of a classical component and a quantum 
component. A Hamiltonian for spin-fermion models de¬ 
fine the coupling between the classical DOE and the 
electrons and among the classical variables themselves. 
In usual ED+MC approaches, the classical variables at 
each site are updated one at a time and the energy of 
the system is calculated by diagonalizing the Hamilto¬ 
nian and adding the classical contribution. This energy 
difference, before and after the update at a site, is used 
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FIG. 1. Two dimensional schematic of the TCA approach. 
Here a iV = 8^ lattice is displayed with classical DOF at each 
site, represented by the red arrows. The Hamiltonian defines 
the coupling of the classical spins with the itinerant electrons 
that are delocalized on the lattice. The TCA algorithm con¬ 
sists of proposing an update at a site (encircled in blue) for 
the classical spins. The update is accepted or rejected based 
on the energy of a cluster built around that site, indicated in 
green. Here the cluster size is Ac = 9. In a system sweep, 
the above procedure is carried out by visiting each site of the 
system sequentially. 


to accept or reject the proposed update. The process is 
then repeated over all of the sites visiting them either 
serially or randomly. This constitutes a single system 
“sweep”. The combined algorithm of ED+MC is numer¬ 
ically rather costly, since the exact diagonalization must 
be performed at every step and the cost scales as 0{N^) 
with N the number of lattice sites. Additionally, with 
a sequential system sweep, the cost of a Monte Carlo 
system sweep scales as at each temperature. 

b. TCA scheme: To reach reasonably large system 
sizes, a real space variant [49] of the ED+MC approach 
has been developed. As will be discussed below, this 
allows for a linear scaling with the system size. A, as 
opposed to the A^ scaling of the computational cost with 
ED+MC. 

In the TCA scheme one defines a region (cluster) 
around the site where a MC update is attempted. The 
cluster has a linear dimension Lc- Then, in a two dimen¬ 
sional square lattice, for example, the number of sites in 
the cluster is Ac = Such a cluster is shown in Eig. 
The cluster is built around a site called the “update” 
site that it is encircled in blue. In this example Ac = 9. 
The key difference with ED+MC is that the proposed 
update is accepted or rejected on the basis of the energy 
difference of the cluster and not the full system. As a 
result one needs to diagonalize only the cluster Hamilto¬ 
nian which costs A^ as opposed to A^ for the full system 
diagonalization in ED+MC. 

The analytical basis for the approximation of using a 
smaller cluster for the annealing process lies in the prin- 
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FIG. 2. The flowchart for a single sequential Monte Carlo 
system sweep in the TCA approach. Here, we consider a 
lattice of N sites. S[ ] is an array containing the classical 
DOF at all the N sites. C[is] is an array of size Nc that 
reads in the classical DOF from S[ ] around the “update site” 
labeled by A- The loop over is runs over all the N sites of the 
system, ensuring that the cluster is built around all the sites 
sequentially and that an update is proposed at each site. The 
flowchart is discussed in the text. 


ciple of “nearsightedness” of electronic matter, as dis¬ 
cussed by W. Kohn [Ml ED- Furthermore, the method 
has been extensively tested and benchmarked in numer¬ 
ical studies [20l |52] . Consequently, in this paper we will 
assume the validity of the approximation without fur¬ 
ther discussion. The above update scheme is sequen¬ 
tially employed at every site of the system. The cluster 
of size Nc is built around every site where the update is 
attempted, hence the name “Traveling” cluster approx¬ 
imation. Thus, within TCA, and at each temperature, 
the computational cost of ED for a system with N sites is 
0{N^) and the cost of a full sweep of the lattice is NN^ 
or linear in N as opposed to N^. 

Many thousand MC system sweeps are performed at 
every temperature and for each temperature a large num¬ 
ber of annealed classical configurations are stored. These 
are later used to construct and diagonalize the full sys¬ 


tem if it is necessary for calculating the desired output 
quantities. They are also useful for studying correlations 
among the classical variables. 

As discussed in the original TCA paper m, the ge¬ 
ometry of the cluster is chosen to be the same as the 
system. Furthermore, one has to impose periodic bound¬ 
ary conditions on the cluster while calculating energies. 
These conditions ensure that in the limit of cluster sizes 
approaching the actual system size, the spectrum be¬ 
comes identical. The periodically identified cluster can 
be considered to be an independent ensemble in contact 
with the full system where equilibrium is maintained in 
a grand canonical framework. An important aspect of 
this setup is that any site on the cluster can be chosen 
as the “update site”. In Fig. we show this update site 
to be equidistant from all the edges. However, any other 
site, for example, the one in the top left corner, is also 
a equivalently good choice. This equivalence has been 
tested in many numerical studies [20l [4^ |52] and we also 
checked numerically the same concept in the context of 
the Holstein model in section VII. 

c. TCA flowchart: We end this section with the TCA 
algorithm, and the corresponding flowchart is presented 
in Fig. In the flowchart the following nomenclature is 
used, and the same will be used for discussing the PTCA 
approach as well. We consider a system with N sites. S[ ] 
is an array containing the classical variables at each site 
and it has the length V, assuming one classical variable 
per site. Hfuii{S[ ]} is the Hamiltonian for the full sys¬ 
tem, generated from the classical variables in S[ ]. The 
array C[is] is of length Nc and it is the array holding the 
classical DOF at the cluster sites built around the site 
of the system. So this array reads the relevant part of 
S[ ]. For example, in Fig. [^(7[is] will read in, from S[ ], 
all classical variables that are at the lattice sites covered 
in the green square. From this setup the cluster Hamil¬ 
tonian, Hcius{C[is]}^ is constructed around the il^ site. 
The site, where the update is proposed and around 
which the cluster is built, is referred to as the “update 
site.” With these notations, the following are the main 
steps explaining the TCA flowchart presented in Fig. 

1. In a single Monte Carlo system sweep, the index ig 
loops over all N sites of the full system. Around 
each of the sites, a cluster will be built one at 
a time, traveling sequentially, as is sweeps over the 
full lattice. 

(a) C[is] reads part of 5'[ ] around the site is. 

(b) Hcius{C[is]} is generated and diagonalized. 

(c) The classical DOF is randomly modified at 
site is. 

(d) Hcius{S[ ]} is generated with the update and 
rediagonalized. 

(e) A Metropolis algorithm decides if the pro¬ 
posed update is accepted or not. 

(f) If accepted the element in S'[ ] is changed 
to the updated value. 
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FIG. 3. One dimensional example using N = S and A^c = 4. 
The light blue sites are the un-updated sites. The red site is 
chosen to be the site where the update is attempted (update 
site). The clusters are indicated by the boxes. The different 
rows present the cases of the cluster traveling sequentially 
from left to right during a single system sweep. The blue 
sites indicate the sites where an update has been attempted. 
The system and cluster both have periodic boundary con¬ 
ditions. For rows 1 through 5, the clusters are built based 
on the original (un-updated) classical variable configurations 
and can be constructed simultaneously instead of serially as 
in TCA. For rows 6 through 8, the clusters require the results 
of the update attempts on sites 1 through 3, respectively. So 
these clusters have to wait till row 4 and then can be built 
simultaneously. 


2. The above process is repeated for all sites of the 
system. 


III. SCHEME FOR PARALLELIZATION 

We now illustrate that it is possible to further reor¬ 
ganize the TCA algorithm to achieve parallelization. 
For this we will use Message Passing Interface (MPI) 
parallelization. 

a. PTC A scheme: In Figure a one-dimensional 
lattice example is used to illustrate the parallelization 
scheme of TCA. In the figure we show a A = 8 site sys¬ 
tem with a Ac = 4 site traveling cluster indicated by a 
rectangle. The update site is marked in red. Let us start 
with some initial values for the classical DOF at all the 
sites. The sites where an update has not yet been pro¬ 
posed are displayed in light blue. The rows from top to 
bottom indicate the different steps of a single MC sys¬ 
tem sweep where the update site traverses from left to 
right sequentially. The sites where the update have been 
attempted are colored in blue. Here for simplicity of pre¬ 
sentation, we discuss the case where the ‘update site’ is 
the site on the extreme left. Other choice of update sites 
are discussed in Sec VII. From the figure it is easy to see 
that assuming the MC system sweep starts from row one, 
the clusters in the first five rows do not depend on the 
update of the previous row. Rows six, seven, and eight 


depend on the outcome of the update attempts at sites 
one; one and two; and one two and three, respectively. 
In general, there are A — Ac + 1 clusters of the former 
kind and Ac — 1 of the later kind. We refer to the later 
kind as “boundary clusters”. 

In TCA the cluster diagonalization involved in all of 
the eight steps are carried out sequentially. The obvious 
way to parallelize the TCA is to diagonalize the indepen¬ 
dent sets of clusters in parallel. The simplest strategy for 
this is to divide the MC system sweep into two blocks, 
L'^ and I/ 2 , each consisting of four of the eight steps of 
the MC sweep. It is easy to see that in this way all the 
four clusters in the block L'^ can be diagonalized on four 
processors simultaneously. Once the updated results for 
L'^ are received, they are used to generate the clusters 
for the 1/2 block which can now be diagonalized in paral¬ 
lel. Thus the computation cost is 2A^ rather than AA^ 
as in TCA. For a d dimensional cubic system, the cost 
of PTC A scales as 2^A^. The 2^ factor comes from the 
correct accounting of all boundary clusters that can not 
be diagonalized simultaneously. This still is very advan¬ 
tageous as compared to the AA^ scaling of TCA. 

In our approach, the one dimensional global system is 
broken into two blocks, each having ^ number of sites. 
In two dimensions, the global square system is broken 
into four blocks and into eight in three dimensions. 

h. PTC A flowchart: We now discuss the implementa¬ 
tion of PTC A. The flowchart is presented in Fig. I^and it 
is discussed below. For simplicity we discuss specifically 
the one dimensional case, but a generalization to higher 
dimensions is straightforward. 

For PTC A we will use Ap+1 processors with ranks 0 to 
Np. As discussed below, of these the rank=0 processor is 
the master and is involved only in receiving and sending 
information, while the rest Np processors are the ones 
that will be used for diagonalization of clusters. As in 
the TCA case, we define 5'[ ] as the array holding all the 
classical DOF for an A site one dimensional system. As 
discussed above, we divide the system into two blocks, 
the loop label running over the blocks is “K”. Within 
each block, another loop, labeled by “I”, runs from one 
to Ns/Np. Here Ns is the number of sites within a 
block and we ensure that Ns/Np is an integer. Note 
that if Ns = Np the clusters built at all the Ns sites can 
be diagonalized in one go. In PTCA the “update site”, 
denoted by P, is a function of A^-, Ap, and the rank of 
the processor on which the cluster built around P is to 
be diagonalized. Thus the update site P is denoted by 
P(A, /, Rank) in the flowchart. C[P] is the cluster built 
around the update site P{K^ Rank). Hfuii{S[ ]} and 
Hcius{C[P]} have definitions similar to that for TCA. 

The MPI commands used are standard [53] and will 
not be repeated here in detail. We use MPI_Init, 
MPI_Coiiim_size, MPI_Comm_rank to allocate and assign 
labels (ranks) to Ap + 1 number of processors. The ranks 
of the processors range from 0 to Ap. 

1. Loop over blocks K (= 1, 2) for our one dimensional 
example. 
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FIG. 4. Parallelized TCA flowchart for a single system sweep. Unlike Fig. the loop over the total number of sites of the 
system is split into two loops: the outer one runs over the number of blocks, while the inner one runs from 1 to the ratio of 
the number of sites in a block (Ns) to the number of processors (Np). The gray rectangles, labeled PCi, PC 2 , PCnp, are 
computed on processors with rank 1 through Np, respectively. The steps in a typical gray block are displayed on the left. See 
text for discussion. 


2. Loop over I goes over 1, 2, Ns/Np. 

3. For each I, assign the construction of the cluster 
around the “update site” P{K, /, Rank = R) and 
the subsequent update procedure to the processor 
with Rank = R^ {R > 0). The Np such assign¬ 
ments are depicted with small gray Np rectangles 
in Fig. 1^ 

4. For a processor with Rank = R, {R> 0), C[P] will 
read the relevant Nc site classical DOF data from 
S[]. It will then diagonalize Hcius{C[P]} before 


and after proposing an update for the classical vari¬ 
able at P. If accepted, the update of the P^^ site 
is sent to processor with Rank=0 using MPI_SEND. 
This is shown in the expanded gray rectangle on 
the left of Fig. 

5. Rank=0 processor receives update from all other 
processors with ranks 1 to Np using MPI_RECV and 
suitably modifies the 5'[ ] on Rank=0 processor. 

6. The “I” loop ends. 

7. The Rank=0 processor broadcasts the modified S[ ] 
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to all the processors using MPI_BCAST, once updates 
from all processors have been received. 

8. End the loop K. 

The parallelization scheme holds for any dimensions, 
as long as A^c < ^ which we guarantee by definition. 
For the simplest case of Ns = Np, the estimated total 
cost of Pi MC sweeps with P 2 full system diagonaliza- 
tion for output calculations is Pi2^A^^ + P 2 N^• This is 
a huge improvement in performance compared to TCA 
for which the computational cost for the same would be 
PiN X ^P2N^ . The improvement is significant when 
Pi is a very large number, which is always the case. In 
the next section we present actual results when Ns > Np 
that establishes that even in this case the reduction of nu¬ 
merical cost is significant. For a fixed Pi and P 2 , beyond 
a certain system size, the full system diagonalization will 
dominate the total computational cost for the PTCA if 
these are “done on the fly”. We suggest saving config¬ 
urations and performing the full system diagonalization 
separately. A strategy for this process using Scalable 
LAPACK is suggested in section VI. 


IV. NUMERICAL BENCHMARKS 

Let us now discuss benchmarks comparing TCA with 
PTCA. For this purpose we will use the following spin- 
fermion Hamiltonian: 




8 16 24 32 40 


L (N=P) 


-f^Hubb-MF = F 

i 

+ j M F 

i i 

This Hamiltonian is the SU(2) invariant Hartree-Fock 
mean field Hamiltonian for the Hubbard model. We have 
recently established [54] that if the mean field expecta¬ 
tion values in Pnubb-MF are treated as classical vari¬ 
ables and annealed via a classical MC process involving 
a slow reduction of the temperature, then the finite tem¬ 
perature results for all the observables we tested agree 
qualitatively and often quantitatively with Determinant 
Quantum Monte Carlo. In this model is the mean 
field magnetization at the site and it is treated as 
a classical vector, t is the hopping parameter and U is 
the Hubbard onsite repulsion. We further set (n^) = 1 
for the case of half filling. This model that involves free 
electrons interacting with the classical spins defines our 
spin fermion model. We will present results in two and 
three dimensions and compare with those obtained using 
ED+MC and TCA in our earlier work [54]. The meth¬ 
ods used in [54] have also been independently derived 
and applied in the context of the Hubbard model on an 
anisotropic triangular lattice [55] and on geometrically 


FIG. 5. (a) Time required for asynchronous ED and full 

ED with message passing against the number of processors 
Vp at a fixed U = 8.0t. See text for definitions. The data is 
presented for a V = 8^ lattice using 2000 MC system sweeps 
at a fixed temperature. The dashed line is the plot of 1/Np; 
(b) Time needed for 200 system sweeps for the full calculation 
with message passing plotted against L, for a system size 
N — . The data for different number of processors, Vp, are 

shown. Np values are indicated on the right. The dashed line 
indicates that the computational cost for solving a V = 32^ 
system using PTCA with 64 processors is almost the same 
as the time needed using TCA on an 8^ system with a single 
processor. Calculations were done using multiple Intel Xeon 
E5-2670 processors which have eight cores with base frequency 
of 2.6 Ghz and 8 GB RAM per processor. 


frustrated face centered cubic lattices [56]. Earlier simi¬ 
lar approaches but for the attractive Hubbard interaction 
(negative U) where reported in Refs. [34ll37l [57ll59] . 

For the results presented in the present paper, the TCA 
cluster size used is Nc = 4^ in three dimensions and Nc = 
4^ in two dimensions. We also checked the independence 
of our results to variations in the traveling cluster size. 

For 8i L X L X L system in three dimensions, with a 
total number of sites N = the matrix size of the 
Hamiltonian for a given configuration of classical fields 
{m} is 2N X 2N. The factor of two comes from the two 
spin species of the fermions. Figure [^ is the main nu- 
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merical result that establishes (i) the efficiency of PTC A 
over TCA and (ii) the dependence of the performance 
of the PTC A on the number of processors Np. In (a) 
we display the bare time, without focusing on measuring 
physical results, for a = 8^ lattice in three dimensions 
with cubic geometry. We further choose Nc = 4^. We 
have performed 2000 MC system sweeps at a fixed tem¬ 
perature, that amounts to 8^ x 2000 or 1.024 x 10^ exact 
diagonalizations of 2 x 4^ matrices defining the travel¬ 
ing clusters. These are performed using the PTCA ap¬ 
proach and employing different numbers of processors. 
The corresponding time needed is plotted in blue against 
Np. Since large Np increases the communication time 
between the processors, for comparison in (a) we have 
also shown the time required to diagonalize the same 
number of matrices but with no interprocessor communi¬ 
cation, labeled as asynchronous. This is indicated in red. 
Also the curve 1/Np (the dashed line) establishes that 
the time needed for the asynchronous ED varies as 1/Np 
within the PTCA scheme. When the processors commu¬ 
nicate (using MPI_SEND, MPI_RECV and MPI_BCAST), the 
time increases with Np. But adds only a few seconds of 
additional time even for Np = 32. This is labeled as Full 
in Fig.[^(a). 

In the previous section we had estimated that the cost 
of a system sweep in PTCA is Pi2^A^^, for Pi diagonal¬ 
izations of the traveling cluster. However, this assumed 
that all independent traveling clusters in one block can be 
diagonalized simultaneously. Since the system sizes can 
be very large, this is seldom possible. As a result only a 
fraction of traveling clusters in one block can be diagonal¬ 
ized simultaneously. It is easy to check that this would 
lead to the l/Np dependence seen in Fig.|^(a) apart from 
the additional processor communication time. In (b) we 
show the time needed for 200 MC system sweeps within 
PTCA against L, for a system. This is displayed 

for different Np values indicated on the right. From (b) 
it is clear that the time for the 200 system sweeps for a 
N = 8^ system with single processor (or within TCA) 
is almost equal to the PTCA cost for a 32^ system with 
Np = 32. 

With this clear advantage of PTCA, in the next section 
we discuss some particular physics results and compare 
them with existing literature. 


V. RESULTS FOR THE HUBBARD MODEL IN 
TWO & THREE DIMENSIONS 

In Fig. we discuss the magnetic properties of the 
Hubbard model in two and three dimensions by studying 
Hamiltonian Eq. (1) at finite temperature using PTCA. 
In our recent work we have extensively studied this sys¬ 
tem using EDMC and TCA [54]. At half filling the Neel 
temperature, T/v, has a non monotonic dependence on 
U. These results are presented here in two and three di¬ 
mensions. In Fig. (a) we plot T/v against U for three 
different system sizes, 4^, 16^, and 40^. These are all 



FIG. 6. (a) Tn vs. U/t for system sizes with 4^, 16^, and 40^ 
sites, (b) The spin structure factor S(7r, tt, tt) vs. temperature, 
for system sizes 4^,6^, 16^, and 40^ at U = 8.0t. The results 
are obtained using a 4^ traveling cluster and 16 processors. 
The data is averaged over 200 configurations obtained from 
2000 MC steps for the cases A/" = 4^ and A = 6^, while only 
1250 MC steps were used for A = 16^ and A = 40^ with an 
average over 50 configurations. The magenta arrow indicates 
the thermodynamic T/v in this case, (c) and (d) present the 
corresponding results in two dimensions, (c) Tn vs. U/t and 
(d) the corresponding S(7r, 7r)’s for system sizes 8^, 100^, and 
256^. The results are obtained using a P traveling cluster 
and 16 processors. 

obtained using PTCA. The 4^ results are identical to the 
4^ results in our earlier work. For the larger system sizes 
studied here we find that Tn converges and it has a weak 
dependence on the finite size of the system. We em¬ 
phasize that until now in the literature there have been 
no results for spin-fermion models employing such large 
number of sites. We use these large system values of T/v 
to perform finite size scaling. In Ref. [54] we had estab¬ 
lished that the magnetic structure factor obtained with 
the TCA agrees with the ED+MC data at all temper¬ 
atures. This indicates that finite size effects associated 
with the cluster size do not affect the finite temperature 
evolution of the magnetic state appreciably. Thus, the 
finite size scaling using TCA or PTCA is justified. 

For these results on finite systems, the bulk T/v es¬ 
timates are obtained by an inspection of the 5'(7r,7r,7r) 
data shown in (b). Information regarding the Neel AFM 
order is obtained from the magnetic structure factor for 
the m variables, 

= ( 2 ) 

hi 

where q = {tt, tt, tt} is the wavevector of interest. 












Then, assuming that the correlation length ^(T/v(T) — 
j^Thermo^ _ aL OH a. systcm, and given that 
^{x) oc one arrives at the scaling form, Tn{L) = 

j^Thermo , Here, L denotes data from a system 

size N = L^. We plot the finite system Neel tempera¬ 
tures against 1/L and use Tj^hermo^ v as fitting 

parameters. Details of this process are presented in our 
earlier work, and here in Fig.[^(a) we simply present the 
results (green diamonds). We have found that indeed 
both the 16^ and 40^ results converge to the true ther¬ 
modynamic Neel temperature. For the antiferromagnetic 
structure factors for different system sizes in (b), we find 
that the PTC A results for N > 16^ are virtually identical 
and appreciably better than the 4^ and 6^ results which 
have non-negligible finite size effects. The arrow in (b) 
indicates the thermodynamic T/v as obtained from finite 
size scaling for the case U/t = S. 

In (c) and (d) we show the corresponding results in 
two dimensions. Here, as it is well known, in principle 
the Mermin-Wagner theorem establishes that there is no 
true T/v in two dimensions for an 0(3) magnet. How¬ 
ever, this theorem is valid only for short-range spin-spin 
interactions. In our case, the integration of the fermions 
leads to effective spin-spin interactions at all distances, 
although the rate of the decay of the couplings with dis¬ 
tance is unknown. Panels (c) and (d) indicate that the 
N = 100^ and N = 256^ results, while significantly lower 
than the 8^ result, are very close to each other suggest¬ 
ing convergence. However, this subtle matter requires 
further discussion and larger clusters to be fully under¬ 
stood and our goal in this section is merely to check the 
performance of the proposed PTCA method. The clarifi¬ 
cation of the validity of the Mermin-Wagner theorem for 
spin-fermion models is left for the future. 


VI. DIAGONALIZATION OF FULL SYSTEM 

In this section we will discuss the strategy for diagonal¬ 
izing large full systems to calculate fermionic observables 
that in principle require all eigenvalues and all eigenvec¬ 
tors for each configuration of classical variables. In the 
PTCA scheme, given the large matrix sizes for the full 
system, we find that it is best to first simply anneal the 
classical variables, then store many equilibrium configu¬ 
rations at each temperature generated during the Monte 
Carlo process, and then at the end perform full system 
diagonalizations to calculate the fermionic observables 
separately. In the special cases where we are interested 
only in the correlation among classical variables of course 
we can certainly measure those correlations for each MC 
configuration. But for the fermionic observable cases that 
require, e.g., full Green functions we suggest using Scal¬ 
able LAPACK for the parallel diagonalization of the full 
system using the equilibrium configurations. 

Let us assume Nq are the number of processors used 
for diagonalizing the large matrices employing Scalable 
LAPACK. Figure (a) shows the memory required to 


store all of the arrays that are necessary to diagonalize 
a large double complex hermitian matrix. It should be 
noted here that the Hamiltonian as one complete array 
is never created on an individual processor. Instead, the 
Hamiltonian is evenly spread out in blocks among all of 
the processors. This greatly reduces the total RAM re¬ 
quired as well as the RAM per processor [60]. The total 
RAM and number of processors for a given system is a 
constant, therefore the ratio of RAM per processor is a 
fixed quantity. For example, in the traveling cluster used 
for these calculations, every job (run) submitted is allo¬ 
cated 2 Gb per processor. If one job uses more RAM than 
this, some processors can not be used since they do not 
have memory available to them. Therefore, when diag¬ 
onalizing large matrices using the number of processors 
that approaches this fixed ratio will optimize the CPU 
time and memory usage. In (a) we see that, as expected, 
the memory requirement grows with matrix and system 
size, but reduces with increasing Nq. The Nq values of 
4, 8, and 16 are indicated in the figure. The gray re¬ 
gion in (a) is where memory needed per processor is 4Gb 
or less. For typical computational resources of multicore 
workstations this is easily available. This requirement 
corresponds to a system size of about 24^ sites. 

The other issue is the time needed for diagonalization. 
In Fig.[^(b) we present the typical time needed for single 
diagonalization corresponding to V = 16^ and 24^ sys¬ 
tem sizes or matrix sizes 8192 x 8192 and 27648 x 27648, 
respectively. The results are for Nq = 4, 16, 24, and 32 
processors. In both cases the time gain is quite significant 
with increasing Nq. If all the configurations over which 
the output quantities are to be averaged at a fixed tem¬ 
perature are calculated in parallel, then for Nq = 8 we 
require only about 0.1 hours of additional computation 
time for a V = 16^. For N = 24^ the additional time 
is about 2 hours. The additional time goes down further 
for larger Nq. High end workstations and small clusters 
should easily be able to supply the resources needed for 
such system sizes. 


VII. DISCUSSION 

In this section, we will discuss two related numerical 
issues and provide an estimate for the cost of solving 
spin-fermion models derived from multiorbital Hubbard 
models: 

a. Cluster size effeets: The first is the dependence of 
results on the traveling cluster size. In Fig.|^(a) we show 
the antiferromagnetic structure factor vs. temperature 
for a lattice with 32^ sites employing different sizes for 
the traveling clusters, using the same mean field Hubbard 
model discussed before. 

While small cluster sizes are good enough to capture 
the long range order as well as the rough location of the 
transition temperature for this model, the finite cluster 
sizes introduce finite size effects of its own. To reduce 
these, one needs to employ larger traveling clusters. In 
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FIG. 7. Memory and time needed for a single diagonaliza- 
tion using Scalable LAPACK. In (a) we show the RAM (in 
Gb) per processor that is required to store the arrays neces¬ 
sary to diagonalize a double complex Hermitian matrix. The 
gray region is defined by a limit of 4Gb RAM per proces¬ 
sor, which can diagonalize a Hamiltonian for a 24^ system 
(27648 dimensional matrix) with 8 processors. The data is 
also presented for 4 and 16 processors; (b) the time needed 
for a single diagonalization for 16^ and 24^ systems vs. the 
number of processors Nq used in Scalable LAPAGK. 


Fig. ^ (a) we see that finite size effects in T/v reduce 
rapidly with larger clusters, Nc = 12^ and 16^, for the 
same fixed system size. Furthermore in physical problems 
where there is long wavelength order, large Nc would be 
crucial to capture the correct phases. In TCA, the linear 
depencence of the numerical cost on the system size limits 
Nc to 8^. Larger Nc = 12^, 16^, results are only possible 
within the current scheme. We would like to emphasize 
that this is an additional significant improvement over 
TCA. 

b. Choice of the update site: In Fig.j^we had displayed 
a scheme for setting up the PTCA. There we had chosen 
the leftmost site of the cluster as the site where the up¬ 
date is attempted. Choosing this leftmost “update site” 
was mainly for convinience. Here we briefly demonstrate 
the effect of choosing other update sites. The paralleliza¬ 
tion, of course, applies to any such choice. In Figs. Kb) 
and (c) we show the comparison of results for different 
choices of the update site. For this purpose, we study the 
adiabatic Holstein model in one dimension at half filling. 
The Hamiltonian for this model is 


HhoI = T] ( 3 ) 

+ Y^{Xxi - iJ,){ni - 1) + K/2Y^x‘^, 

i i 

where rii = {rii^^ + ftie particle-hole symmetric 

adiabatic Holstein model, the classical variables {x} at 
every site denote the static lattice distortions. A is the 
electron lattice coupling and K regulates the elastic cost 
of the lattice deformation. In this model the goal is to 
determine the optimal configuration of the {x} variables 



FIG. 8. (a) Antiferromagnetic order, S(7r, tt), for the two di¬ 
mensional Hubbard model at U/t = 8 using a N = 32^ lat¬ 
tice with difference traveling cluster sizes as indicated, (b) 
The one dimensional checkerboard charge order parameter 
vs. temperature for the adiabatic Holstein model, (c) shows 
the corresponding average energy of the system with temper¬ 
ature. The data is presented for two schemes indicated as 
‘central’ and ‘corner’, see text for discussion. The parameters 
for (b) and (c) are the same. 


that minimizes the free energy. At half filling the model 
exhibits a checkerboard charge order together with large 
and small lattice distortions m- The charge order can 
be probed by plotting the structure factor for the classical 
variables. This is defined by 

= (4) 

hj 

where q = ir is the wavevector of interest. 

In our study two schemes were used: scheme ‘1’ where 
the update site is the leftmost site of the traveling cluster, 
and scheme ‘2’ where the {Nc/2 + 1)^^ site is the update 
site. In the one dimensional study with N = 32 and 
Nc = 8, the sites 1 and 4 are the choices for the update 
site and the two schemes are refered to as ‘corner’ and 
‘central’, respectively. We do not present the details of 
the algorithm for ‘central’ scheme here, which is very 
similar to the earlier scheme. We just mention here that 
one needs to choose a different way of distributing which 
clusters are to be diagonalized in parallel. The numerical 
advantage is comparable in both schemes. 

In Fig.|^(b) we study the correlation between the clas¬ 
sical variables. N{q = tt), as defined before, is plotted 
as a function of temperature. At low temperature an al¬ 
ternating large-small pattern generates a peak at g = tt 
in the charge structure factor. As seen in the figure, the 
results from both schemes match with each other. In 
addition in (c) we show the average energy with temper¬ 
ature, which also agrees over a wide temperature range. 

c. Numerical cost for multiorbital Hubbard model: To 
derive the general formula for numerical cost of PTCA for 
a multiorbital Hubbard system, we first note from Sec. 
IV, that we had divided the system into 2^ blocks. Thus 
Ns = V/2^, where Ns is the number of sites in a block. 
Secondly, since we build a cluster around each of those 
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Ns sites in a block, the time taken for a MC system sweep 
is simply the cost of a single cluster diagonalization times 
the number of blocks. To do so, however, requires us 
to diagonalize all Ns clusters in a block simultaneously. 
This would require Ns processors. Typically, for large 
systems, the number of processors A/’p, is much smaller 
than Ns. In such cases only Np number of clusters in a 
block can be diagonalized simultaneously. Thus the cost 
to complete the diagonalizations of all Ns clusters in a 
block would be Ns/Np times the cost of diagonalization 
of a single cluster. 

From these, it is easy to deduce that the cost for Pi MC 
steps in the PTC A as discussed in section IV, Pi2^V^, 
can be written as Pi{Nc)^ x As a consequence, the 
cost for Pi MC steps for No orbitals (with two spins 
per orbital), would be Pi(2VoA^c)^ x From this 

expression it can be shown that if V = Np then the cost 
of PTCA is the cost of Pi cluster diagonalizations. If 
Np = 1, then the cost grows linearly with V, which is 
precisely the case for TCA. Finally for a general Vp, the 
cost scales as 1/Np. 


VIII. CONCLUSIONS 

In conclusion, we have provided a reorganization of 
the TCA algorithm that allows for a straightforward 
parallelization. To test the method, we have presented 


results for the Hubbard model in two and three dimen¬ 
sions treated in the mean field approximation and for 
the Holstein model with classical lattice distortions in 
one dimension. A comparison with earlier work clearly 
shows that the PTCA approach can produce reliable 
results on very large lattices. Apart from accessing large 
system sizes for the case of the single orbital Hubbard 
model, the new approach will facilitate the study of finite 
temperature effects in multiorbital Hubbard models, 
treated in the mean field approximation, where the large 
orbital degeneracy (up to five orbitals in models for iron 
superconductors) severely limits the number of sites that 
can be solved employing ED+MC, even when including 
the TCA improvement. 
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